BUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULE- BASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY by
نویسندگان
چکیده
This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes’ theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the tagger’s accuracy. The tagger’s final accuracy on the testing data is 96.51% and its speed is about 26,000 words per second on a computer with two-gigabyte random access memory and two 3.00 GHz Pentium duo processors. The tagger’s portability and trainability are proved by the taggermaker’s success in building a new tagger out of a corpus that is annotated with the tagset different from that of Penn Treebank. INDEX WORDS: Part-of-Speech, Tagging, Markov Model, The Viterbi Algorithm, The Baysian Theorem, Machine Learning, Contextual rules, Natural Language Processing BUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULEBASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY
منابع مشابه
Active Incremental Recognition of Human Activities in a Streaming Context
Recognising human activities from streaming sources poses unique challenges to learning algorithms. Predictive models need to be scalable, incrementally trainable, and must remain bounded in size even when the data stream is arbitrarily long. In order to achieve high accuracy even in complex and dynamic environments methods should be also nonparametric, i.e., their structure should adapt in res...
متن کاملThe Hidden Information State Dialogue Manager: A Real-World POMDP-Based System
The Hidden Information State (HIS) Dialogue System is the first trainable and scalable implementation of a spoken dialog system based on the PartiallyObservable Markov-Decision-Process (POMDP) model of dialogue. The system responds to n-best output from the speech recogniser, maintains multiple concurrent dialogue state hypotheses, and provides a visual display showing how competing hypotheses ...
متن کاملTrainable High Resolution Melt Curve Machine Learning Classifier for Large-Scale Reliable Genotyping of Sequence Variants
High resolution melt (HRM) is gaining considerable popularity as a simple and robust method for genotyping sequence variants. However, accurate genotyping of an unknown sample for which a large number of possible variants may exist will require an automated HRM curve identification method capable of comparing unknowns against a large cohort of known sequence variants. Herein, we describe a new ...
متن کاملTrainable, Scalable Summarization Using Robust NLP and Machine Learning
We describe a trainable and scalable summarization system which utilizes features derived from information retrieval, inibrmation extraction, and NLP techniques and on-line resources. The system con> bines these features using a trainable feature combiner learned from summary examples through a machine learning algorithm. We demonstrate system scalability by reporting results on the best combin...
متن کاملTrainable and Dynamic Computing: Error Backpropagation through Physical Media
Machine learning algorithms, and more in particular neural networks, arguably experience a revolution in terms of performance. Currently, the best systems we have for speech recognition, computer vision and similar problems are based on neural networks, trained using the half-century old backpropagation algorithm. Despite the fact that neural networks are a form of analog computers, they are st...
متن کامل